Hybrid probabilistic sampling with random subspace for imbalanced data learning

نویسندگان

  • Peng Cao
  • Dazhe Zhao
  • Osmar R. Zaïane
چکیده

Class imbalance is one of the challenging problems for machine learning in many real-world applications. Other issues, such as within-class imbalance and high dimensionality, can exacerbate the problem. We propose a method HPSDRS that combines two ideas: Hybrid Probabilistic Sampling technique ensemble with Diverse Random Subspace to address these issues. HPS improves the performance of traditional re-sampling algorithms with the aid of probability function, since it is not sufficient to simply manipulate the class sizes for imbalanced data with complex distribution. Moreover, DRS ensemble employs the minimum overlapping mechanism to provide diversity and weighted voting, so as to improve the generalization performance. The experimental results demonstrate that our method is efficient for learning from imbalanced data and can achieve better results than state-of-the-art methods for imbalanced data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ensemble-based hybrid probabilistic sampling for imbalanced data learning in lung nodule CAD

Classification plays a critical role in false positive reduction (FPR) in lung nodule computer aided detection (CAD). The difficulty of FPR lies in the variation of the appearances of the nodules, and the imbalance distribution between the nodule and non-nodule class. Moreover, the presence of inherent complex structures in data distribution, such as within-class imbalance and high-dimensionali...

متن کامل

ForesTexter: An efficient random forest algorithm for imbalanced text categorization

In this paper, we propose a new Random Forest (RF) based ensemble method, ForesTexter, to solve the imbalanced text categorization problems. RF has shown great success in many real-world applications. However, the problem of learning from text data with class imbalance is a relatively new challenge that needs to be addressed. A RF algorithm tends to use a simple random sampling of features in b...

متن کامل

An Effective Method for Imbalanced Time Series Classification: Hybrid Sampling

Most traditional supervised classification learning algorithms are ineffective for highly imbalanced time series classification, which has received considerably less attention than imbalanced data problems in data mining and machine learning research. Bagging is one of the most effective ensemble learning methods, yet it has drawbacks on highly imbalanced data. Sampling methods are considered t...

متن کامل

A hybrid approach to learn with imbalanced classes using evolutionary algorithms

There is an increasing interest in application of Evolutionary Algorithms to induce classification rules. This hybrid approach can aid in areas that classical methods to rule induction have not been completely successful. One example is the induction of classification rules in imbalanced domains. Imbalanced data occur when some classes heavily outnumbers other classes. Frequently, classical Mac...

متن کامل

CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification

Class imbalance classification is a challenging research problem in data mining and machine learning, as most of the real-life datasets are often imbalanced in nature. Existing learning algorithms maximise the classification accuracy by correctly classifying the majority class, but misclassify the minority class. However, the minority class instances are representing the concept with greater in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Intell. Data Anal.

دوره 18  شماره 

صفحات  -

تاریخ انتشار 2014